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5_i Abstract 

Q- 

We introduce a class of depth-based classification procedures that are of 

a nearest-neighbor nature. Depth, after symmetrization, indeed provides the 

center-outward ordering that is necessary and sufficient to define nearest neigh- 

bors. The resulting classifiers are affine-invariant and inherit the nonparamet- 

ric validity from nearest-neighbor classifiers. In particular, we prove that the 

i ^ i proposed depth-based classifiers are consistent under very mild conditions. We 

y— l investigate their finite-sample performances through simulations and show that 

> 

they outperform affine-invariant nearest-neighbor classifiers obtained through 
an obvious standardization construction. We illustrate the practical value of 

o 

(N 



our classifiers on two real data examples. Finally, we shortly discuss the possible 
uses of our depth-based neighbors in other inference problems. 
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1 INTRODUCTION 



The main focus of this work is on the standard classification setup in which the 
observation, of the form (X, Y), is a random vector taking values in M. d x {0, 1}. A 
classifier is a function m : M. d — > {0, 1} that associates with any value x a predictor 
for the corresponding "class" Y. Denoting by 1^ the indicator function of the set A, 
the so-called Bayes classifier, defined through 



is optimal in the sense that it minimizes the probability of misclassification P[m(X) ^ 
Y]. Under absolute continuity assumptions, the Bayes rule rewrites 



where ttj = P[Y = j] and fj denotes the pdf of X conditional on [Y = j]. Of 
course, empirical classifiers mS n ^ are obtained from i.i.d. copies (Xj, YJ), % — 1, . . . , n, 
of (X, Y), and it is desirable that such classifiers are consistent, in the sense that, 
as n —¥ oo, the probability of misclassification of ifS n \ conditional on (Xj,Yi), 
i = l,...,n, converges in probability to the probability of misclassification of the 
Bayes rule. If this convergence holds irrespective of the distribution of (X, Y), the 
consistency is said to be universal. 

Classically, parametric approaches assume that the conditional distribution of X 
given [Y = j] is multinormal with mean fij and covariance matrix (j = 0,1). 
This gives rise to the so-called quadratic discriminant analysis ( QDA ) — or to linear 
discriminant analysis (LDA) if it is further assumed that Eo = Sj. It is standard 
to estimate the parameters // • and Ej (j = 0, 1) by the corresponding sample means 
and empirical covariance matrices, but the use of more robust estimators was recom- 
mended in many works; see, e.g., Randies et al. (1978), He and Fung (2000), Dchon 
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^Bayes(x) = I 7/(x) > 1/2 , 



with r/(x) = P[Y = 1 |X = x], 



(1.1) 



^Bayes(x) = I 



/i(x) ^0" 
/o(x) TTi-T 



(1.2) 



and Croux (2001), or Hartikainen and Oja (2006). Irrespective of the estimators used, 
however, these classifiers fail to be consistent away from the multinormal case. 

Denoting by o?s(x, //) = ((x — /i)'E~' (x — /x)) 1//2 the Mahalanobis distance be- 
tween x and fx in the metric associated with the symmetric and positive definite 
matrix E, it is well known that the QDA classifier rewrites 

^qda(x) = I cfe^x,/^) < d So (x,/x ) + C , (1.3) 

where the constant C depends on So, Ei, and -kq, hence classifies x into Population 1 
if it is sufficiently more central in Population 1 than in Population (centrality, 
in elliptical setups, being therefore measured with respect to the geometry of the 
underlying equidensity contours). This suggests that statistical depth functions, that 
are mappings of the form x n- Z)(x, P) indicating how central x is with respect to a 
probability measure P (see Section 2.1 for a more precise definition), are appropriate 
tools to perform nonparametric classification. Indeed, denoting by Pj the probability 
measure associated with Population j (j = 0,1), (1.3) makes it natural to consider 
classifiers of the form 



m jD (x) = I 



D(x,P 1 )> D(x,P ) 



based on some fixed statistical depth function D. This max- depth approach was first 
proposed in Liu et al. (1999) and was then investigated in Ghosh and Chaudhuri 
(2005b). Dutta and Ghosh (2012a,b) considered max-depth classifiers based on the 
projection depth and on (an affine-invariant version of) the L p depth, respectively. 
Hubert and Van der Veeken (2010) modified the max-depth approach based on pro- 
jection depth to better cope with possibly skewed data. 

Recently, Li et al. (2012) proposed the "Depth vs Depth" (DD) classifiers that ex- 
tend the max-depth ones by constructing appropriate polynomial separating curves 
in the DD-plot, that is, in the scatter plot of the points (D^CX.^), _D^(Xj)), i = 



1, . ..,n, where (Xj) refers to the depth of Xj with respect to the data points 
coming from Population j. Those separating curves are chosen to minimize the em- 
pirical misclassification rate on the training sample and their polynomial degree m is 
chosen through cross-validation. Lange et al. (2012) defined modified DD-classifiers 
that are computationally efficient and apply in higher dimensions (up to d — 20). 
Other depth-based classifiers were proposed in Jornsten (2004), Ghosh and Chaud- 
huri (2005a), and Cui et al. (2008). 

Being based on depth, these classifiers are clearly of a nonparametric nature. An 
important requirement in nonparametric classification, however, is that consistency 
holds as broadly as possible and, in particular, does not require "structural" distri- 
butional assumptions. In that respect, the depth-based classifiers available in the 
literature are not so satisfactory, since they are at best consistent under elliptical 
distributions only 1 . This restricted-to-ellipticity consistency implies that, as far as 
consistency is concerned, the Mahalanobis depth is perfectly sufficient and is by now 
means inferior to the "more nonparametric" (Tukey (1975)) halfspace depth or (Liu 
(1990)) simplicial depth, despite the fact that it uninspiringly leads to LDA through 
the max-depth approach. Also, even this restricted consistency often requires esti- 
mating densities; see, e.g., Dutta and Ghosh (2012a,b). This is somewhat undesirable 
since density and depth are quite antinomic in spirit (a deepest point may very well 
be a point where the density vanishes). Actually, if densities are to be estimated in 
the procedure anyway, then it would be more natural to go for density estimation all 
the way, that is, to plug density estimators in (1.2). 

The poor consistency of the available depth-based classifiers actually follows from 

their global nature. Zakai and Ritov (2009) indeed proved that any universally consis- 
1 The classifiers from Dutta and Ghosh (2012b) are an exception that slightly extends consistency 
to (a subset of) the class of L p -elliptical distributions. 
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tent classifier needs to be of a local nature. In this paper, we therefore introduce local 
depth-based classifiers, that rely on nearest-neighbor ideas (kernel density techniques 
should be avoided, since, as mentioned above, depth and densities are somewhat in- 
compatible). From their nearest-neighbor nature, they will inherit consistency under 
very mild conditions, while from their depth nature, they will inherit affine-invariance 
and robustness, two important features in multivariate statistics and in classification 
in particular. Identifying nearest neighbors through depth will be achieved via an 
original symmetrization construction. The corresponding depth-based neighborhoods 
are of a nonparametric nature and the good finite-sample behavior of the resulting 
classifiers most likely results from their data-driven adaptive nature. 

The outline of the paper is as follows. In Section 2, we first recall the concept 
of statistical depth functions (Section 2.1) and then describe our symmetrization 
construction that allows to define the depth-based neighbors to be used later for clas- 
sification purposes (Section 2.2). In Section 3, we define the proposed depth-based 
nearest-neighbor classifiers and present some of their basic properties (Section 3.1) 
before providing consistency results (Section 3.2). In Section 4, Monte Carlo sim- 
ulations are used to compare the finite-sample performances of our classifiers with 
those of their competitors. In Section 5, we show the practical value of the proposed 
classifiers on two real-data examples. We then discuss in Section 6 some further 
applications of our depth-based neighborhoods. Finally, the Appendix collects the 
technical proofs. 

2 DEPTH-BASED NEIGHBORS 

In this section, we review the concept of statistical depth functions and define the 
depth-based neighborhoods on which the proposed nearest-neighbor classifiers will be 
based. 
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2.1 Statistical depth functions 

Statistical depth functions allow to measure centrality of any x G M. d with respect 
to a probability measure P over W 1 (the larger the depth of x, the more central x is 
with respect to P). Following Zuo and Serfling (2000a), we define a statistical depth 
function bounded mapping P>(-,P) from R d to R + that satisfies the following 
four properties: 

(PI) affine-invariance: for any d x d invertible matrix A, any <i-vector b and any 
distribution P over IR d , D(Ax + b, P A ' b ) = P(x, P), where P Ab is defined 
through P A,b [P] = P[A _1 (P — b)] for any <i-dimensional Borel set P; 

(P2) maximality at center: for any P that is symmetric about 6 (in the sense 2 
that P[6 + B] = P[0 - B] for any d-dimensional Borel set P), D(6,P) = 
sup xeRd P(x,P); 

(P3) monotonicity relative to the deepest point: for any P having deepest point d, 
P>(x, P) < D((l - \)6 + Ax, P) for any x6l d and any A G [0, 1]; 

(P4) vanishing at infinity: for any P, P(x, P) -)• as ||x|| — > oo. 

For any statistical depth function and any a > 0, the set R a (P) = {x G M. d : 
P(x, P) > a} is called t/ie depth region of order a. These regions are nested, and, 
clearly, inner regions collect points with larger depth. Below, it will often be conve- 
nient to rather index these regions by their probability content : for any (3 G [0, 1), 
we will denote by P^(P) the smallest R a (P) that has P-probability larger than or 
equal to (5. Throughout, subscripts and superscripts for depth regions are used for 
depth levels and probability contents, respectively. 

Zuo and Serfling (2000a) also considers more general symmetry concepts; however, we restrict 
in the sequel to central symmetry, that will be the right concept for our purposes. 
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Celebrated instances of statistical depth functions include 

(i) the Tukey (1975) halfspace depth £> H (x,P) = inf u65 d-i P[u'(X - x) > 0], 
where S^ 1 = {u G R d : ||u|| = 1} is the unit sphere in M. d ] 

(ii) the Liu (1990) simplicial depth D s (x, P) = P[x G 5(Xi, X 2 , . . . , X d+1 )], where 
5*(xi, x 2 , . . . , x<i + i) denotes the closed simplex with vertices xi, x 2 , . . . , x^+i and 
where X 1; X 2 , . . . , X rf+1 are i.i.d. P; 

(hi) the Mahalanobis depth D Af (x, P) = 1/(1 + d%,, p Jx., /z(P))), for some affine- 
equivariant location and scatter functionals /i(P) and S(P); 

(iv) the projection depth Pp r (x, P) = l/(l+sup ue5 d-i |u'x— /i(P[ u ])|/cr(P[ u ])), where 
P[ u ] denotes the probability distribution of u'X when X ~ P and where /i(P) 
and a(P) are univariate location and scale functionals, respectively. 

Other depth functions are the simplicial volume depth, the spatial depth, the L p 
depth, etc. Of course, not all such depths fulfill Properties (P1)-(P4) for any dis- 
tribution P; see Zuo and Serfling (2000a). A further concept of depth, of a slightly 
different (L 2 ) nature, is the so-called zonoid depth; see Koshevoy and Mosler (1997). 

Of course, if <i-variate observations Xi, . . . , X n are available, then sample versions 
of the depths above are simply obtained by replacing P with the corresponding empir- 
ical distribution P^ (the sample simplicial depth then has a [/-statistic structure). 

A crucial fact for our purposes is that a sample depth provides a center- outward 
ordering of the observations with respect to the corresponding deepest point 9 : 
one may indeed order the Xj's in such a way that 

£>(X (1) , P (n) ) > P(X (2) , ?W) > . . . > D(X (n) , PW). (2.1) 

Neglecting possible ties, this states that, in the depth sense, X(i) is the observation 
closest to 6 , X( 2 ) the second closest, . . . , and X( n ) the one farthest away from 6 . 
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For most classical depths, there may be infinitely many deepest points, that form 
a convex region in IR d . This will not be an issue in this work, since the symmetrization 
construction we will introduce, jointly with Properties (Q2)-(Q3) below, asymptot- 
ically guarantee unicity of the deepest point. For some particular depth functions, 
unicity may even hold for finite samples: for instance, in the case of halfspace depth, 
it follows from Rousseeuw and Struyf (2004) and results on the uniqueness of the 
symmetry center (Serfling (2006)) that, under the assumption that the parent distri- 
bution admits a density, symmetrization implies almost sure unicity of the deepest 
point. 

2.2 Depth-based neighborhoods 

A statistical depth function, through (2.1), can be used to define neighbors of the deep- 
en) 

est point a . Implementing a nearest-neighbor classifier, however, requires defining 
neighbors of any point x G M. d . Property (P2) provides the key to the construction 
of an x-outward ordering of the observations, hence to the definition of depth-based 
neighbors of x : symmetrization with respect to x. 

More precisely, we propose to consider depth with respect to the empirical dis- 
tribution Px associated with the sample obtained by adding to the original ob- 
servations Xi, X2, . . . , X„ their reflections 2x — Xi, . . . , 2x — X n with respect to x. 
Property (P2) implies that x is the — unique (at least asymptotically; see above)— 
deepest point with respect to Px . Consequently, this symmetrization construction, 
parallel to (2.1), leads to an (x-outward) ordering of the form 

£>(X X!(1) ,PW) > £>(X X)(2) ,Pi«)) > > £>(X Xi(n) ,pW). 

Note that the reflected observations are only used to define the ordering but are not 
ordered themselves. For any k G {1, . . . ,n}, this allows to identify — up to possible 
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ties — the k nearest neighbors X x> (j), i — 1, . . . , k, of x. In the univariate case (d = 1), 
these k neighbors coincide — irrespective of the statistical depth function D — with the 
k data points minimizing the usual distances \X^ — x\, i — 1, . . . , n. 

In the sequel, the corresponding depth-based neighborhoods — that is, the sample 
depth regions R^l = R a (Px ) — will play an important role. In accordance with 
the notation from the previous section, we will write for the smallest depth 

region R^l that contains at least a proportion /3 of the data points Xi, X2, . . . , X n . 
For f3 = k/n, R^ is therefore the smallest depth-based neighborhood that contains k 
of the Xj's; ties may imply that the number of data points in this neigborhood, K^ n ^ 
say, is strictly larger than k. 

Note that a distance (or pseudo-distance) (x, y) i-> g?(x, y) that is symmetric in its 
arguments is not needed to identify nearest neighbors of x. For that purpose, a col- 
lection of "distances" y t— > d x (y) from a fixed point is indeed sufficient (in particular, 
it is irrelevant that this distance satisfies or not the triangular inequality). In that 
sense, the (data-driven) symmetric distance associated with the Oja and Paindaveine 
(2005) lift-interdirections, that was recently used to build nearest-neighbor regression 
estimators in Biau et al. (2012), is unnecessarily strong. Also, only an ordering of the 
"distances" is needed to identify nearest neighbors. This ordering of distances from a 
fixed point x is exactly what the depth-based x-outward ordering above is providing. 



3 DEPTH-BASED /cNN CLASSIFIERS 

In this section, we first define the proposed depth-based classifiers and present some 
of their basic properties (Section 3.1). We then state the main result of this paper, 
related to their consistency properties (Section 3.2). 
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3.1 Definition and basic properties 

The standard /c-nearest-neighbor (&NN) procedure classifies the point x into Popula- 
tion 1 iff there are more observations from Population 1 than from Population in the 
smallest Euclidean ball centered at x that contains k data points. Depth-based fcNN 
classifiers are naturally obtained by replacing these Euclidean neighborhoods with the 
depth-based neighborhoods introduced above, that is, the proposed /cNN procedure 
classifies x into Population 1 iff there are more observations from Population 1 than 
from Population in the smallest depth-based neighborhood of x that contains k 
observations — i.e., in Rx , (3 = k/n. In other words, the proposed depth-based 
classifier is defined as 



T!um = i]wf (rt) (x) > ztim = o]wf (n) ( X ) , (3.i) 



with Wf n) (x) = -^IfX, G Ri {n) ], where K^ n) = Ipt,- e R x {n) ] still denotes 



the number of observations in the depth-based neighborhood R x ■ Since 



m^(x) =1 



4 n) (x) > 1/2 , with C j (x) = TZ*m = l]< W (x), (3.2) 



the proposed classifier is actually the one obtained by plugging, in (1.1), the depth- 
based estimator fj^ipc) of the conditional expectation r/(x). This will be used in 
the proof of Theorem 3.1 below. Note that in the univariate case (d = 1), m£\ 
irrespective of the statistical depth function D, reduces to the standard (Euclidean) 
/cNN classifier. 

It directly follows from Property (PI) that the proposed classifier is affine- invariant, 
in the sense that the outcome of the classification will not be affected if X 1; . . . , X n 
and x are subject to a common (arbitrary) affine transformation. This clearly im- 
proves over the standard kNN procedure that, e.g., is sensitive to unit changes. Of 
course, one natural way to define an affine- invariant fcNN classifier is to apply the orig- 
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inal /cNN procedure on the standardized data points E Xj, i — 1, . . . , n, where E 
is an affine-equivariant estimator of shape — in the sense that 

E(AXi + b, . . . , AX„ + b) oc AE(Xi, . . . , X n )A' 

for any invertible d x d matrix A and any d- vector b. A natural choice for E is the 
regular covariance matrix, but more robust choices, such as, e.g., the shape estimators 
from Tyler (1987), Diimbgen (1998), or Hettmansperger and Randies (2002) would 
allow to get rid of any moment assumption. Here, we stress that, unlike our adaptive 
depth-based methodology, such a transformation approach leads to neighborhoods 
that do not exploit the geometry of the distribution in the vicinity of the point x to 
be classified (these neighborhoods indeed all are ellipsoids with x-independent orienta- 
tion and shape); as we show through simulations below, this results into significantly 
worse performances. 

The main depth-based classifiers available — among which those relying on the 

max-depth approach of Liu et al. (1999) and Ghosh and Chaudhuri (2005b), as well as 

the more efficient ones from Li et al. (2012) — suffer from the "outsider problem 3 " : if 

the point x to be classified does not sit in the convex hull of any of the two populations, 

then most statistical depth functions will give x zero depth with respect to each 

population, so that x cannot be classified through depth. This is of course undesirable, 

all the more so that such a point x may very well be easy to classify. To improve 

on this, Hoberg and Mosler (2006) proposed extending the original depth fields by 

using the Mahalanobis depth outside the supports of both populations, a solution 

that quite unnaturally requires combining two depth functions. Quite interestingly, 

our symmetrization construction implies that the depth-based /cNN classifier (that 

involves one depth function only) does not suffer from the outsider problem; this is 

an important advantage over competing depth-based classifiers. 
3 The term "outsider" was recently introduced in Lange et al. (2012). 
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While our depth-based classifiers in (3.1) are perfectly well-defined and enjoy, as 
we will show in Section 3.2 below, excellent consistency properties, practitioners might 
find quite arbitrary that a point x such that ^ILi^P^ = ( x ) = SlLi^[^i = 

0]Wf (x) is assigned to Population 0. Parallel to the standard fcNN classifier, the 
classification may alternatively be based on the population of the next neighbor. Since 
ties are likely to occur when using depth, it is natural to rather base classification 
on the proportion of data points from each population in the next depth region. 
Of course, if the next depth region still leads to an ex-aequo, the outcome of the 
classification is to be determined on the subsequent depth regions, until a decision 
is reached (in the unlikely case that an ex-aequo occurs for all depth regions to be 
considered, classification should then be done by flipping a coin). This treatment of 
ties is used whenever real or simulated data are considered below. 

Finally, practitioners have to choose some value for the smoothing parameter k n . 
This may be done, e.g., through cross-validation (as we will do in the real data 
example of Section 5). The value of k n is likely to have a strong impact on finite- 
sample performances, as confirmed in the simulations we conduct in Section 4. 

3.2 Consistency results 

As expected, the local (nearest-neighbor) nature of the proposed classifiers makes 
them consistent under very mild conditions. This, however, requires that the statis- 
tical depth function D satisfies the following further properties: 

(Ql) continuity: if P is symmetric about 6 and admits a density that is positive at 6 
and continuous in a neighborhood of 6, then x i— > D(pc,P) is continuous in a 
neighborhood of 6. 

(Q2) unique maximization at the symmetry center: if P is symmetric about 6 and 
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admits a density that is positive at 6 and continuous in a neighborhood of 6, 
then D(0, P) > £>(x, P) for all x ^ 9. 

(Q3) consistency: for any bounded <i-dimensional Borel set S, sup xeB |-D(x, P™) — 
-D(x, P)\ = o(l) almost surely as n — )■ oo, where P^ n ' denotes the empirical 
distribution associated with n random vectors that are i.i.d. P. 

Property (Q2) complements Property (P2), and, in view of Property (P3), only 
further requires that 6 is a strict local maximizer of x i— > D(x,P). Note that Prop- 
erties (Q1)-(Q2) jointly ensure that the depth-based neighborhoods of x from Sec- 
tion 2.2 collapse to the singleton {x} when the depth level increases to its maximal 
value. Finally, since our goal is to prove that our classifier satisfies an asymptotic 
property (namely, consistency), it is not surprising that we need to control the asymp- 
totic behavior of the sample depth itself (Property (Q3)). As shown by Theorem A.l 
in the Appendix, Properties (Q1)-(Q3) are satisfied for many classical depth func- 
tions. 

We can then state the main result of the paper. 

Theorem 3.1 Let D be a depth function satisfying (P2), (PS) and (Ql)-(QS). Let k n 
be a sequence of positive integers such that k n — > oo and k n = o(n) as n — » oo. Assume 
that, for j = 0,1, X|[y = j] admits a density fj whose collection of discontinuity 
points is closed and has Lebesgue measure zero. Then the depth-based k n NN classifier 
rr$ in (3.1) is consistent in the sense that 

P[mg } (X) ^ Y | V n \ - P[m Bay cs(X) ^Y}= o P {\) as n -»• oo, 

where T> n is the sigma-algebra associated with (Xj, Yi), i — 1, . . . , n. 

Classically, consistency results for classification are based on a famous theorem 
from Stone (1977); see, e.g., Theorem 6.3 in Devroye et al. (1996). However, it is an 
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open question whether Condition (i) of this theorem holds or not for the proposed clas- 
sifiers, at least for some particular statistical depth functions. A sufficient condition 
for Condition (i) is actually that there exists a partition of M. d into cones C\, . . . , C ld 
with vertex at the origin of IR (7^ not depending on n) such that, for any Xj and 
any j, there exist (with probability one) at most k data points £ Xj + Cj that 
have Xj among their k depth-based nearest neighbors. Would this be established for 
some statistical depth function D, it would prove that the corresponding depth-based 
fc n NN classifier rhp is universally consistent, in the sense that consistency holds 
without any assumption on the distribution of (X, Y). 

Now, it is clear from the proof of Stone's theorem that this condition (i) may be 
dropped if one further assumes that X admits a uniformly continuous density. This 
is however a high price to pay, and that is the reason why the proof of Theorem 3.1 
rather relies on an argument recently used in Biau et al. (2012); see the Appendix. 



4 SIMULATIONS 

We performed simulations in order to evaluate the finite-sample performances of the 
proposed depth-based ZcNN classifiers. We considered six setups, focusing on bivariate 
Xj's {al = 2) with equal a priori probabilities (tt = tti = 1/2), and involving the 
following densities fo and f\\ 

Setup 1 (multinormality) fj, j = 0, 1, is the pdf of the bivariate normal distribution 
with mean vector (Mj and covariance matrix Ej, where 

*> = (!!)• "■ = (!)• M! 4)- El = 4E ° ; 

Setup 2 (bivariate Cauchy) fj, j = 0, 1, is the pdf of the bivariate Cauchy distribution 
with location center Hj and scatter matrix Ej, with the same values of /ij and Ej 
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as in Setup 1; 

Setup 3 (flat covariance structures) fj, j = 0,1, is the pdf of the bivariate normal 
distribution with mean vector jXj and covariance matrix Ej, where 

*>=(!!)• "■=(!)• M o °)' E,=Eo; 

Setup 4 (uniform distributions on half-moons) fo and /i are the densities of 

respectively, where £7 ~ Unif(— 1, 1) and V\[U — u] ~ Unif(l - -u 2 , 2(1 - -u 2 )); 

Setup 5 (uniform distributions on rings) fo and /i are the uniform distributions on the 
concentric rings {x G R 2 : 1 < ||x|| < 2} and {x G M 2 : 1.75 < ||x|| < 2.5}, 
respectively; 

Setup 6 (bimodal populations) fj, j = 0, 1, is the pdf of the multinormal mixture 
|^(/i},Ej) + |^(/if,Ef), where 




For each of these six setups, we generated 250 training and test samples of 
size n = n tra in = 200 and n tes t = 100, respectively, and evaluated the misclassifi- 
cation frequencies of the following classifiers: 

1. the usual LDA and QDA classifiers (LDA/QDA); 

2. the standard Euclidean /cNN classifiers (kNN), with f3 = k/n = 0.01, 0.05, 
0.10 and 0.40, and the corresponding "Mahalanobis" /cNN classifiers (kNNaff) 
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obtained by performing the Euclidean fcNN classifiers on standardized data, 
where standardization is based on the regular covariance matrix estimate of the 
pooled training sample; 

3. the proposed depth-based fcNN classifiers (D-kNN) for each combination of the 
k used in kNN/kNNaff and a statistical depth function (we focused on halfspace 
depth, simplicial depth, or Mahalanobis depth); 

4. the depth vs depth (DD) classifiers from Li et al. (2012), for each combina- 
tion of a polynomial curve of degree m (m — 1, 2, or 3) and a statistical 
depth function (halfspace depth, simplicial depth, or Mahalanobis depth). Ex- 
act DD-classifiers (DD) as well as smoothed versions (DDsm) were actually 
implemented — although, for computational reasons, only the smoothed version 
was considered for m = 3. Exact classifiers search for the best separating poly- 
nomial curve (d,r(d)) of order m passing through the origin and m "DD-points" 
{D^\Xi),D^\Xi)) (see the Introduction) in the sense that it minimizes the 
missclassification error 

n 

(m = i« (n) > o] + m = on-d^ > o]) , (4.1) 

1=1 

with d[ n) := r(D^\Xi)) - D^(Xi). Smoothed versions use derivative-based 
methods to find a polynomial minimizing (4.1), where the indicator I[d > 0] is 
replaced by the logistic function 1/(1 + e~ td ) for a suitable t. As suggested in 
Li et al. (2012), value t = 100 was chosen in these simulations. 100 randomly 
chosen polynomials were used as starting points for the minimization algorithm, 
the classifier using the resulting polynomial with minimal misclassification (note 
that this time-consuming scheme always results into better performances than 
the one adopted in Li et al. (2012), where only one minimization is performed, 
starting from the best random polynomial considered). 
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Since the DD classification procedure is a refinement of the max-depth procedures 
of Ghosh and Chaudhuri (2005b) that leads to better misclassification rates (see Li 
et al. (2012)), the original max-depth procedures were omitted in this study. 

Boxplots of misclassification frequencies (in percentages) are reported in Figures 1 
and 2. The main learnings from these simulations are the following: 

• In most setups, the proposed depth-based fcNN classifiers compete well with the 
Euclidean fcNN classifiers and improve over the latter under the flat covariance 
structures in Setup 3. This may be attributed to the lack of affine-invariance of 
the Euclidean /cNN classifiers, which leads to discard this procedure and rather 
focus on its affine- invariant version (kNNaff). It is very interesting to note 
that the /cNNaff classifiers in most cases are outperformed by the depth-based 
kNN classifiers. In other words, the natural way to make the standard fcNN 
classifier affine-invariant results into a dramatic cost in terms of finite-sample 
performances. Incidentally, we point out that, in some setups, the choice of the 
smoothing parameter k n appears to have less impact on affine-invariant fcNN 
procedures than on the original fcNN procedures; see, e.g., Setup 3. 

• The proposed depth-based fcNN classifiers also compete well with DD-classifiers 
both in elliptical and non-elliptical setups. Away from ellipticity (Setups 4 
to 6), in particular, they perform at least as well — and sometimes outperform 
(Setup 4) — DD-classifiers; a single exception is associated with the use of Ma- 
halanobis depth in Setup 5, where the DD-classifiers based on m = 2,3 per- 
form better. Apparently, another advantage of depth-based fcNN classifiers over 
DD-classifiers is that their finite-sample performances depend much less on the 
statistical depth function D used. 
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Figure 1: Boxplots of misclassification frequencies (in percentages), from 250 replica- 
tions of Setups 1 to 3 described in Section 4, with training sample size n = n train = 200 
and test sample size n tes t = 100, of the LDA/QDA classifiers, the Euclidean kNN 
classifiers (kNN) and their Mahalanobis (affine-invariant) counterparts (KNNaff), the 
proposed depth-based kNN classifiers (D-fcNN), and some exact and smoothed version 
of the DD-classifiers (DD and DDsm); see Section 4 for details. 
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Figure 2: Boxplots of misclassification frequencies (in percentages), from 250 replica- 
tions of Setups 4 to 6 described in Section 4, with training sample size n = n train = 200 
and test sample size n tes t = 100, of the LDA/QDA classifiers, the Euclidean kNN 
classifiers (kNN) and their Mahalanobis (affine-invariant) counterparts (KNNaff), the 
proposed depth-based kNN classifiers (D-A.NN), and some exact and smoothed version 
of the DD-classifiers (DD and DDsm); see Section 4 for details. 
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5 REAL-DATA EXAMPLES 



In this section, we investigate the performances of our depth-based fcNN classifiers on 
two well known benchmark datasets. The first example is taken from Ripley (1996) 
and can be found on the book's website (http://www.stats.ox.ac.uk/pub/PRNN). 
This data set involves well-specified training and test samples, and we therefore simply 
report the test set misclassification rates of the different classifiers included in the 
study. The second example, blood transfusion data, is available at http: //archive, 
ics.uci.edu/ml/index.html. Unlike the first data set, no clear partition into a 
training sample and a test sample is provided here. As suggested in Li et al. (2012), we 
randomly performed such a partition 100 times (see the details below) and computed 
the average test set missclassification rates, together with standard deviations. 
A brief description of each dataset is as follows: 

Synthetic data was introduced and studied in Ripley (1996). The 
dataset is made of observations from two populations, each of them being 
actually a mixture of two bivariate normal distributions differing only in 
location. As mentioned above, a partition into a training sample and a test 
sample is provided: the training and test samples contain 250 and 1000 
observations, respectively, and both samples are divided equally between 
the two populations. 

Transfusion data contains the information on 748 blood donors selected 
from the blood donor database of the Blood Transfusion Service Center 
in Hsin-Chu City, Taiwan. It was studied in Yeh et al. (2009). The 
classification problem at hand is to know whether or not the donor gave 
blood in March 2007. In this dataset, prior probabilities are not equal; out 
of 748 donors, 178 gave blood in March 2007, when 570 did not. Following 

20 



Li et al. (2012), one out of two linearly correlated variables was removed 
and three measurements were available for each donor: Recency (number 
of months since the last donation), Frequency (total number of donations) 
and Time (time since the first donation). The training set consists in 100 
donors from the first class and 400 donors from the second, while the rest 
is assigned to the test sample (therefore containing 248 individuals). 

Table 1 reports the — exact (synthetic) or averaged (transfusion) — misclassification 
rates of the following classifiers: the linear (LDA) and quadratic (QDA) discriminant 
rules, the standard kNN classifier (kNN) and its Mahalanobis affine-invariant ver- 
sion (kNNaff), the depth-based kNN classifiers using halfspace depth (D#-/cNN) and 
Mahalanobis depth (D^-ZcNN), and the exact DD-classifiers for any combination of 
a polynomial order m e {1,2} and a statistical depth function among the two con- 
sidered for depth-based kNN classifiers, namely the halfspace depth (DD#) and the 
Mahalanobis depth (DD M ) — smoothed DD-classifiers were excluded from this study, 
as their performances, which can only be worse than those of exact versions, showed 
much sensitivity to the smoothing parameter t; see Section 4. For all nearest-neighbor 
classifiers, leave-one-out cross-validation was used to determine k. 

The results from Table 1 indicate that depth-based kNN classifiers perform very 
well in both examples. For synthetic data, the halfspace depth-based kNN classifier 
(10.1%) is only dominated by the standard (Euclidian) kNN procedure (8.7%). The 
latter, however, has to be discarded as it is dependent on scale and shape changes — in 
line with this, note that the "/cNN classifier" applied in Dutta and Ghosh (2012b) 
is actually the kNNaff classifier (11.7%), as classification in that paper is performed 
on standardized data. The Mahalanobis depth-based fcNN classifiers (14.4%) does 
not perform as well as its halfspace counterpart. For transfusion data, however, both 
depth-based fcNN classifiers dominate their competitors. 
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Synthetic 


Transfusion 


LDA 




10.8 


29.60 


(0.9) 


QDA 




10.2 


29.21 


(1.5) 


kNN 




8.7 


29.74 


(2.0) 


kNNaff 




11.7 


30.11 


(2.1) 


D H -kNN 




10.1 


27.75 


(1.6) 


D M -kNN 




14.4 


27.36 


(1.5) 


DD H (m = 


1) 


13.4 


28.26 


(1.7) 


DD H (to = 


2) 


12.9 


28.33 


(1.6) 


DD M (m = 


1) 


17.5 


31.44 


(0.1) 


DD M (m = 


2) 


12.0 


31.54 


(0.6) 



Table 1: Missclassification rates (in %), on the two benchmark datasets considered 
in Section 5, of the linear (LDA) and quadratic (QDA) discriminant rules, the stan- 
dard kNN classifier (kNN) and its Mahalanobis affme-invariant version (kNNaff), the 
depth-based kNN classifiers using halfspace depth (D#-fcNN) and Mahalanobis depth 
(Da/-/cNN), and the exact DD-classifiers for any combination of a polynomial degree 
m G {1,2} and a choice of halfspace depth (DD#) or Mahalanobis depth (DDm). 
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6 FINAL COMMENTS 



The depth-based neighborhoods we introduced are of interest in other inference prob- 
lems as well. As an illustration, consider the regression problem where the conditional 
mean function x y m(x) = E\Y | X = x] is to be estimated on the basis of mutu- 
ally independent copies (X i; Yj), i — 1, . . . , n of a random vector (X, Y) with values 
in R d x R, or the problem of estimating the common density / of i.i.d. random d- 
vectors Xj, i = 1, . . . , n. The classical fc„NN estimators for these problems are 

, n n 

/»(x) = *" and mW(x) = Y,W^ n \*)Y l = ± £l[X„ G 5*]*, (6.1) 

where /3 n = fc n /n, -B^ is the smallest Euclidean ball centered at x that contains 
a proportion /3 of the Xj's, and fid stands for the Lebesgue measure on IR d . Our 
construction naturally leads to considering the depth-based k n NN estimators (x) 
and rhj£ (x) obtained by replacing in (6.1) the Euclidean neighborhoods B^. n with 
their depth-based counterparts R%> and k n = EiLi 1 !^ e B t] with ^x" (n) = 

A thorough investigation of the properties of these depth-based procedures is of 
course beyond the scope of the present paper. It is, however, extremely likely that 
the excellent consistency properties obtained in the classification problem extend to 
these nonparametric regression and density estimation setups. Now, recent works 
in density estimation indicate that using non-spherical (actually, ellipsoidal) neigh- 
borhoods may lead to better finite-sample properties; see, e.g., Chacon (2009) or 
Chacon et al. (2011). In that respect, the depth-based /cNN estimators above are 
very promising since they involve non-spherical (and for most classical depth, even 
non-ellipsoidal) neighborhoods whose shape is determined by the local geometry of 
the sample. Note also that depth-based neighborhoods only require choosing a single 
scalar bandwidth parameter (namely, k n ), whereas general d- dimensional ellipsoidal 
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neighborhoods impose selecting d(d + l)/2 bandwidth parameters. 

A APPENDIX 

The main goal of this Appendix is to prove Theorem 3.1. We will need the following 
lemmas. 

Lemma A.l Assume that the depth function D satisfies (P2), (P3), (Ql), and (Q2). 
Let P be a probability measure that is symmetric about 6 and admits a density that is 
positive at 6 and continuous in a neighborhood of 6. Then, (i) for alia > 0, there exists 
a < a* = max xgR d D(x, P) such that R a (P) C Bg(a) := {x G M. d : ||x — 0\\ < a}; (ii) 
for all a < a*, there exists £ > such that £>#(£) C R a (P). 

Proof of Lemma A.l. (i) First note that the existence of a* follows from Property 
(P2). Fix then 5 > such that x y D(pc,P) is continuous over Bg(5); existence 
of 5 is guaranteed by Property (Ql). Continuity implies that x i— > £)(x, P) reaches 
a minimum in Bg(S), and Property (Q2) entails that this minimal value, a$ say, is 
strictly smaller than a*. Using Property (Ql) again, we obtain that, for each a G 
[as, «*], 

r a : S d - X -> R + 

u sup{r G R + : 6 + ru G R a (P)} 

is a continuous function that converges pointwise to r„, (u) = as a — > a*. Since <S d-1 
is compact, this convergence is actually uniform, i.e., sup u£iS d-i |r a (u)| = o(l) as 
a a*. Part (i) of the result follows. 

(ii) Property (Q2) implies that, for any a G [as, a*), the mapping r a takes values 
in Rq. Therefore there exists Uo(a) G S^ 1 such that r a (u) > r Q (u (a)) = ^ a > 0. 
This implies that, for all a G [aj, a*), we have Bg(^ a ) C R a (P), which proves the 
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result for these values of a. Nestedness of the P a (P)'s, which follows from Prop- 
erty (P3), then establishes the result for an arbitrary a < a*. □ 

Lemma A. 2 Assume that the depth function D satisfies (P2), (PS), and (Ql)-(QS). 
Let P be a probability measure that is symmetric about 6 and admits a density that is 
positive at 6 and continuous in a neighborhood of 6. Let Xx, . . . , X„ be i.i.d. P and 
denote by ~&ou) the ith depth-based nearest neighbor of 6. Let Kg n( ~ n ^ be the number of 
depth-based nearest neighbors in Rq(P^), where (3 n = k n /n is based on a sequence k n 
that is as in Theorem 3. 1 and P^ stands for the empirical distribution o/X 1; . . . , X n . 
Then, for any a > 0, there exists n = n(a) such that Xa=i ^OI-^W — ^11 > a] = 
almost surely for all n > n(a). 

Note that, while X^) may not be properly defined (because of ties), the quan- 
tity J2i=i — ^11 > a ] = always is. 

Proof of Lemma A. 2. Fix a > 0. By Lemma A.l, there exists a < a* such that 
R a (P) C P>o(a). Fix then a and e > such that a < a — e < a + e < a*. Theorem 4.1 
in Zuo and Serfling (2000b) and the fact that P (n) -»■ P e = P weakly as n — > oo 
(where P^ and Pe are the ^-symmetrized versions of P^ and P, respectively) then 
entail that there exists an integer no such that 

Ra + e(P) C RaiP^) C Ra-e(P) C R a (P) 

almost surely for all n > uq. From Lemma A.l again, there exists £ > such that 
Pe(£) C Ra+ E (P)- Hence, for any n > no, one has that 

B (O C Ra{P^) C B e {a) 

almost surely. 

Putting N n = EILi 1 ^ G 5 «(OL the SLLN yields that N n /n ->■ P[P<?(0] = 
P[Pe(£)] > as n — > oo, since X ~ P admits a density that, from continuity, is 
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positive over a neighborhood of 6. Since k n = o(n) as n — » oo, this implies that, for 
all n > no(> ^o)i 

^IpCiG^P^)] >iV ft >A; n 

8=1 

almost surely. It follows that, for such values of n, 

Rfr(pM) = R^P^) C P s (P (n) ) C B e {a) 

almost surely, with (3 n = k n /n. Therefore, max. =1 K Mn) HX^) — 6\\ < a almost 
surely for large n, which yields the result. □ 

Lemma A. 3 For a "plug-in" classification rule m^(x) = I[r7^(x) > 1/2] obtained 
from a regression estimator fj^fe) of 7/(x) = i£[I[Y = 1] | X = x], one has that 
P[mW(X) + Y]-L opt < 2 (P[(r/W(X)-r ? (X)) 2 ]) 1/2 , where L opt = P[m Baycs (X) + Y\ 
is the probability of mis classification of the Bayes rule. 

Proof of Lemma A. 3. Corollary 6.1 in Devroye et al. (1996) states that 

F[m W (X) ± Y | V n \ - L opt < 2 E[\fj^(X) - V (X)\ \ V n ], 

where T> n stands for the sigma-algebra associated with the training sample (Xj, Y^), 
i = l,...,n. Taking expectations in both sides of this inequality and applying 
Jensen's inequality readily yields the result. □ 

Proof of Theorem 3.1. From Bayes' theorem, X admits the density x i— > /(x) = 
7T /o(x) + tti/i(x). Letting Supp + (/) = {x G R d : /(x) > 0} and writing C(/ i ) for 
the collection of continuity points of fj, j = 0, 1, put iV = Supp + (/) fl C(f ) nC(fx). 
Since, by assumption, M. d \ C(fj) (j = 0, 1) has Lebesgue measure zero, we have that 

P[XeR d \N] < P[X G M d \Supp + (/)] + P[X G IR d \ C(fj)] 



ie{o,i} 



/ /(x) dx = 0, 



'Rd\ Sup p +(/ ) 
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so that P[X E N] = 1. Note also that x i-)- 77 (x) = 7Ti/i(x)/(7r /o(x) + 7Ti/i(x)) is 
continuous over iV. 

Fix x e A/" and let i^ m = ij(x) with j(x) such that X x (j) = Xj-( x ). With this 
notation, the estimator rj^(pc) from Section 3.1 rewrites 



(x) = E^ (n) w = iE^ 

i=l Ax i=l 

Proceeding as in Biau et al. (2012), we therefore have that (writing for simplicity (3 
instead of (3 n in the rest of the proof) 



TW(x) := £[(r/S°(x) - r/(x)) 2 ] < 2T 1 (n) (x) + 2T 2 (ri) (x 



(n). 



with 



and 



7\ (n) (x) = £ 



T (n) (x) = £ 



K 



1 

E (K,(0 - ^(X x ,(i))) 



/3(n) 



A" 



1 

E (^( X X,W)-^( X )) 



x 1=1 



Writing X>^ for the sigma-algebra generated by X;, i = 1, . . . ,n, note that, condi- 
tional on T>^\ the — ?y(X Xi (j))'s, i = 1, . . . , n, are zero mean mutually independent 
random variables. Consequently, 

Tf } (x 



E 



A 



< £ 



L^E^w-^.-,)) 2 !^] 



( ^(n) )2 



AT- 



/3(n) 



-F = o(1) ' 
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as n — > oo, where we used the fact that K^ n ^ > k n almost surely. As for (x), the 
Cauchy-Schwarz inequality yields (for an arbitrary a > 0) 



T (n) fx) < E 



E 



K 



1 V"— v 

ivi**,®) - v(*)Y 



/3(n) 

x i=l 



/3(«) 



K 



+E 



l m WXx, (i) ) -r;(x)) 2 I[||X X)(i) -x|| < a 

x i=l 

R-|8(n) 
1 % 

Yl H x x,(i)) -»K x )) 2]I [ll x *,(i) ~ X H > 



■/?(n) 

x 8=1 



< sup \r](y) - r](x.)\ 2 + A E 

yeB x {a) 

=: T 2 (x; a) + T 2 ( x > a). 



A' 



X 

«h E I[l|X x , W -x|| > 



0(n) 

x i=l 



Continuity of at x implies that, for any e > 0, one may choose a = a(e) > so 
that T 2 (x; a(e)) < e. Since Lemma A. 2 readily yields that (x; a ( £ )) = f° r large n, 
we conclude that (x) — hence also T^x) — is o(l). The Lebesgue dominated 
convergence theorem then yields that E[(j)p (X) — ?7(X)) 2 ] is o(l). Therefore, using 
the fact that P[m^ (X) ^ Y \ V n ] > L opt almost surely and applying Lemma A. 3, we 
obtain 



E[\P[m%\x) £Y\V n \- L opt |] = E[P[m^(X) ^Y\V n ]-L 



(n). 



opt J 



P[m^(X) + Y\ - L opt < 2 {E[(r) [ »\X) - r?(X)) 2 ])^ = o(l), 



(*), 



as n — y oo, which establishes the result. 



□ 



Finally, we show that Properties (Q1)-(Q3) hold for several classical statistical 
depth functions. 

Theorem A.l Properties (Ql)-(QS) hold for (i) the halfspace depth and (ii) the 
simplicial depth. (Hi) If the location and scatter functionals n{P) and £(P) are such 
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that (a) fi(P) = 9 as soon as the probability measure P is symmetric about 6 and such 
that (b) the empirical versions n(P^) and H(P^) associated with an i.i.d. sample 
Xi, . . . , X n from P are strongly consistent for fi(P) and E(P) ; then Properties ( Ql )- 
(Q3) also hold for the Mahalanobis depth. 

Proof of Theorem A.l. (i) The continuity of D in Property (Ql) actually holds 
under the only assumption that P admits a density with respect to the Lebesgue 
measure; see Proposition 4 in Rousseeuw and Ruts (1999). Property (Q2) is a con- 
sequence of Theorems 1 and 2 in Rousseeuw and Struyf (2004) and the fact that 
the angular symmetry center is unique for absolutely continuous distributions; see 
Serfhng (2006). For halfspace depth, Property (Q3) follows from (6.2) and (6.6) in 
Donoho and Gasko (1992). 

(ii) The continuity of D in Property (Ql) actually holds under the only assumption 
that P admits a density with respect to the Lebesgue measure; see Theorem 2 in Liu 
(1990). Remark C in Liu (1990) shows that, for an angularly symmetric probability 
measure (hence also for a centrally symmetric probability measure) admitting a den- 
sity, the symmetry center is the unique point maximizing simplicial depth provided 
that the density remains positive in a neighborhood of the symmetry center; Prop- 
erty (Q2) trivially follows. Property (Q3) for simplicial depth is stated in Corollary 1 
of Diimbgen (1992). 

(hi) This is trivial. □ 

Finally, note that Properties (Q1)-(Q3) also hold for projection depth under very 
mild assumptions on the univariate location and scale functionals used in the defini- 
tion of projection depth; see Zuo (2003). □ 
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